🤖 feat: harden CoderControlPlane controller (leader election, readiness, CRD defaults, tests)#41
Conversation
Regenerated after: - CRD default markers (image, replicas, service type/port) - RBAC markers for coordination.k8s.io/leases and events
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e455d98ec7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Detect pod namespace from the in-cluster service account file, falling back to kube-system for out-of-cluster development runs. Without an explicit namespace, controller-runtime fails at startup when not running inside a pod.
|
@codex review Fixed the |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c9063e50cb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Add +kubebuilder:default={} on Service fields so nested defaults
(type, port) apply even when spec.service is omitted entirely.
- Add POD_NAMESPACE env var as first-priority leader election namespace
source, enabling explicit override for out-of-cluster dev runs.
|
@codex review Addressed both P2 comments:
|
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Close operational and test gaps in the CoderControlPlane controller, making it safer for HA deployments, more observable, and harder to regress.
Background
The CoderControlPlane controller audit identified several best-practice gaps:
kubectl get -o yaml.Implementation
1. Leader election (
controllerapp.go)LeaderElection: truewith stable IDcoder-k8s-controller.coder.com.LeaderElectionReleaseOnCancel: truefor faster handoff on shutdown.coordination.k8s.io/leasesandevents.2. Cache-sync readiness check (
controllerapp.go)/readyzfromhealthz.Pingto a cache-sync check usingmgr.GetCache().WaitForCacheSyncwith a 2-second timeout./healthzremainshealthz.Ping(liveness should be cheap).3. CRD default markers (
api/v1alpha1/)+kubebuilder:defaultmarkers on:CoderControlPlaneSpec.Image→"ghcr.io/coder/coder:latest"CoderControlPlaneSpec.Replicas→1ServiceSpec.Type→"ClusterIP"ServiceSpec.Port→804. Regenerated manifests
config/crd/bases/).config/rbac/role.yaml).5. Expanded envtest coverage (5 new tests)
TestReconcile_StatusPersistence: asserts ObservedGeneration, URL, ReadyReplicas, Phase after reconcile.TestReconcile_OwnerReferences: verifies Deployment/Service have controller owner refs.TestReconcile_SpecUpdatePropagates: updates spec fields, reconciles, asserts child resources updated.TestReconcile_PhaseTransitionToReady: manually sets deployment ready replicas, asserts phase transition to Ready.TestReconcile_DefaultsApplied: reconciles minimal spec, asserts defaults in child resources.Validation
make verify-vendor✅make build✅make lint✅make test✅ (all packages pass, including 5 new envtest tests)Risks
📋 Implementation Plan
Plan: Close CoderControlPlane controller best-practice gaps
Context / Why
The
CoderControlPlanecontroller works, but the audit identified operational and test gaps that make it harder to run safely in production (HA), harder to debug, and easier to regress:/readyzand/healthzare bothhealthz.Ping(readiness doesn’t reflect cache/API readiness).CoderControlPlanetests cover creation and defensive nil checks, but miss spec updates, status assertions, and key error paths.Goal: harden the controller app + reconciler and expand tests so the behavior is safer and the contract is verified.
Evidence (what was inspected)
internal/app/controllerapp/controllerapp.goScheme+HealthProbeBindAddress(noLeaderElection).healthz.Ping.internal/controller/codercontrolplane_controller.goObservedGeneration,ReadyReplicas,URL,Phase.internal/controller/codercontrolplane_controller_test.gointernal/controller/suite_test.goenvtestwith CRDs loaded fromconfig/crd/bases.internal/controller/workspaceproxy_controller_test.gocontains richer assertions (including status).Implementation details
1) Enable leader election for the controller manager
Why: safe HA operation; prevents multiple controllers racing on the same resources.
Edits
internal/app/controllerapp/controllerapp.goctrl.Options.LeaderElectionID.LeaderElectionReleaseOnCancel: truefor faster handoff on shutdown.Suggested shape:
RBAC (required if leader election is enabled)
controller-gen(e.g.,internal/app/controllerapp/controllerapp.go):// +kubebuilder:rbac:groups=coordination.k8s.io,resources=leases,verbs=get;list;watch;create;update;patch;deletemake manifests) soconfig/rbac/role.yamlincludesleasespermissions.deploy/rbac.yaml,config/e2e/*) bind the controller SA to the regenerated role (or update them if they’re intended to be used).2) Make readiness meaningful (cache sync instead of Ping)
Why: Kubernetes should only route traffic / treat the controller as ready once caches are started and synced.
Edits
internal/app/controllerapp/controllerapp.gohealthz.Ping.mgr.GetCache().WaitForCacheSync). Ensure it does not hang indefinitely by using a short timeout.Suggested shape:
3) Move defaults out of “hidden” controller logic (prefer CRD defaults)
Why: users and automation should be able to see the effective defaults via
kubectl get -o yamland avoid surprise behavior.Preferred approach (low overhead): CRD defaults via kubebuilder markers
+kubebuilder:defaultmarkers to API types.api/v1alpha1/codercontrolplane_types.goCoderControlPlaneSpec.Imagedefaultghcr.io/coder/coder:latestCoderControlPlaneSpec.Replicasdefault1api/v1alpha1/types_shared.goServiceSpec.TypedefaultClusterIPServiceSpec.Portdefault80Example:
make manifeststo update CRDs underconfig/crd/bases/.Alternative: mutating webhook defaulting
If CRD defaults for nested structs (like
spec.service) are insufficient or too awkward, a mutating webhook can set defaults reliably. This is higher operational overhead (webhook server, certs, deployment changes) and is likely not worth it unless defaults become complex.4) Expand envtest coverage for CoderControlPlane
Why: prevent regressions in update behavior and status reporting.
Edits
internal/controller/codercontrolplane_controller_test.goAdd tests mirroring the thoroughness of
workspaceproxy_controller_test.go:Status is persisted after reconcile
GettheCoderControlPlaneand assert:Status.ObservedGeneration == GenerationStatus.URL == fmt.Sprintf("http://%s.%s.svc.cluster.local:%d", ...)Status.ReadyReplicas == 0(in envtest)Status.Phase == PendingOwner references are set (in lieu of GC tests)
Deployment.OwnerReferencesandService.OwnerReferencesinclude the CP withcontroller=true.Spec updates propagate
Status transitions to Ready when deployment is “ready”
deployment.Status.ReadyReplicas = 1viak8sClient.Status().Update.Status.Phase == ReadyandReadyReplicas == 1.Defaulting behavior is covered
5) Add targeted error-path tests (unit-style)
Why: ensure reconcile wraps and returns actionable errors (especially around status writes).
Approach:
client.Clientwrapper used only in tests to force failures on:Status().Update(to coverupdate control plane status: ...path)Createfor Deployments/Services (to cover reconcile child errors)This avoids needing a full fake API server while still verifying error-handling behavior.
6) Validation / regeneration checklist
When implementing:
make manifests(CRD + RBAC generation)make testmake lintmake buildExpected outcome
Generated with
mux• Model:anthropic:claude-opus-4-6• Thinking:xhigh• Cost:$0.57